Testing, explaining, and exploring models of facial expressions of emotions

Models are the hallmark of mature scientific inquiry. In psychology, this maturity has been reached in a pervasive question—what models best represent facial expressions of emotion? Several hypotheses propose different combinations of facial movements [action units (AUs)] as best representing the six basic emotions and four conversational signals across cultures. We developed a new framework to formalize such hypotheses as predictive models, compare their ability to predict human emotion categorizations in Western and East Asian cultures, explain the causal role of individual AUs, and explore updated, culture-accented models that improve performance by reducing a prevalent Western bias. Our predictive models also provide a noise ceiling to inform the explanatory power and limitations of different factors (e.g., AUs and individual differences). Thus, our framework provides a new approach to test models of social signals, explain their predictive power, and explore their optimization, with direct implications for theory development.

Step 2: Embedding stimuli. In the previous section we outlined how to formalize AU-emotion models as points (or, equivalently, vectors) in AU space. One way to evaluate these formalized models is to evaluate them with actual categorical emotion labels from human participants in response to stimuli with known AU configurations.
In reference to our previously defined hypothetical 10-dimensional AU space, assume that we have categorical emotion labels e from a set of emotions E ( ∈ ) in response to a collection of N facial expression stimuli parameterized with random AU configurations, drawn from the same 10-dimensional AU space discussed before. With such data, we can encode the stimuli in AU space by quantifying each stimulus, Si, as a 10-dimensional "stimulus vector" containing nonzero values at positions associated with active AUs for that stimulus and zeros elsewhere. Note that, as is the case with model vectors, the stimulus vector's nonzero values at positions associated with active AUs can be all ones (if assumed to be equally "active") or be values proportional to the amplitude (or "activity") of the active AUs. For example, suppose that stimulus Si contains AU1, AU5, and AU26 with amplitudes 0.1, 0.5, and 0.8 respectively (assuming a maximum amplitude of 1). Then, formally, we can represent this stimulus, Si, as the following stimulus vector: In case of multiple stimuli (S1, S2, … SN), their model vectors can be vertically stacked in a single N × D "stimulus matrix", S. Given that both a model (M) and set of stimuli (S) are encoded as matrices in the same D-dimensional AU space, we can discuss using kernels to generate quantitative predictors for stimuli given a particular theory.
Step 3: kernel functions. Kernel functions ( ) measure the similarity between two vectors. We use them to quantify the similarity ( ) between an AU stimulus (Si) and the AU model (Mj) for a specific emotion (e.g., happy): Most linear kernel functions use a variant of the inner product between the two vectors. Here, we used the cosine kernel that normalizes the inner product between two vectors by the product of their norm: Note that this model configuration is equivalent to a K-nearest neighbor classifier in which each AU vector (Mj) functions as a neighbor (i.e., K = C). Also, Instead of using measures of similarity between two vectors (i.e., "kernels"), one could use measures of distances ( ij) between two vectors instead and subsequently invert it to get a similarity score again, i.e., "' = "' )* . In practice, we find that it does not make much of a difference in terms of predictive performance (see Table S4).
Step 4: computing predictions. To generate a quantitative prediction for stimulus Si (i.e., A ), we still need a decision function that maps the data to a prediction. To generate a discrete prediction, we can take as prediction the emotion (across C classes) that maximizes its similarity of kernel function ( ), to the stimulus: In our analyses, we instead generate probabilistic predictions by normalizing the similarity vector with the softmax function. The resulting vector sums to 1 and its elements can be interpreted as probabilities (i.e., of an emotion given a stimulus and model matrix): where is the "inverse temperature" parameter -a scaling parameter -which distributes relatively more mass onto the largest values within the sequence of similarities. We treat this parameter as a model hyperparameter (i.e., that could be manually tuned using cross-validation) that we set to 1.
Step 5: quantifying model performance. To evaluate the performance of each model, we compared their (discrete or probabilistic) emotion predictions for a set of stimuli with the actual participant responses. We quantified model performance, or "score", with function (q) that takes as inputs a set of predicted labels (A) and a set of "true" labels ( ) and returns the model performance: Instead of a single, class-average model performance estimate, some metrics return class-specific model performance scores. Our analyses use the "area under the curve of the receiver operating characteristic" (AUROC), which summarizes the quality of probabilistic predictions in a range from 0 to 1 (0.5 is chance level; 1 perfect prediction). We prefer AUROC because it applies to discrete and probabilistic predictions, it is insensitive to class imbalance (i.e., unequal frequencies of target classes), and it allows for class-specific performance estimates (for C > 2).

Noise ceiling estimation
Suppose that, for a given dataset, we find that a particular model yields a (class-average) AUROC score of 0.8 -what should we conclude from this score? It is above chance level (a score of 0.5) but also below perfect performance (i.e., a score of 1). Here, we interpret performance relative to a noise ceiling that represents an upper bound that incorporates the within-and between-participant variance in class labels. A noise ceiling estimates an upper bound for predictive models that is adjusted for the consistency of the target variable.
Noise ceiling partitions unexplained variance into in principle explainable variance (i.e., the noise ceiling minus the model performance) and unexplainable variance (i.e., the theoretical maximum performance minus the noise ceiling; see Figure 2). The amount of explainable variance in turn quantifies how much there is to gain for model improvement: if it is large, one might consider different or more complex or differently parameterized models; if it is small (i.e., the model performance is at or near the noise ceiling), one can conclude that the model cannot be improved any further (which does not mean that it is the correct model, however). With AUemotion models, the noise ceiling illustrates the portion of variance in emotion inferences that can be explained by AUs.
Noise ceilings are routinely estimated in systems and cognitive neuroscience (8,9). However, existing methods are limited to regression models that assume a continuous target variable, (e.g. a modality of brain measure). Our study deals with classification models, as we are trying to predict a categorical target variable (i.e., categorical emotion labels). We propose a novel approach to estimate a noise ceiling for predictive performance of classification models.
To estimate a noise ceiling, we need repeated trials to estimate the variance (or, inversely, the "consistency") of the target variable and to estimate an upper bound on predictive performance. A noise ceiling formalizes the idea that a model can only perform as well as the consistency of responses. Trials are considered "repeats" if their representation in the model is the samehere, stimuli with the same AU configuration. Moreover, repeated trials may occur within-or between-participants, to estimate within-or between-noise a ceiling when working with a withinor between-participant model. Here, we only considered between-participant variance (as our dataset only contains between-participant repeats).
To illustrate noise ceiling computation for a between-participant model, consider the following minimal example. Suppose three participants labeled the same two facial expression stimuli (S1 and S2) as outlined in Table S7. The noise (nc) ceiling represents an upper bound of predictive performance for a given set of observations, the performance that an optimal model would obtain: Here, the optimal model can be any type of model constrained to make the same predictions for repeated observations. To a model, repeated observations represent identical input that should logically be identically predicted.
When predictions are discrete (i.e., a single label per trial), the optimal model predicts the mode across repeated observations. In our example, the optimal model would predict S1 as "Anger" and S2 as "Disgust." The noise ceiling is subsequently computed as the performance (e.g. with AUROC or simple accuracy metrics) of the optimal model given the true labels. In the example, the optimal model correctly predicts 2 out of 3 labels per stimulus, a class-average AUROC noise ceiling of 0.6667.
Discrete predictions can result in multiple modes (e.g., a given stimulus might be labeled "anger" in 50% of participants and "disgust" in the other 50% of participants). One could randomly pick one of these modes as the optimal prediction, which could arbitrarily impact the class-specific noise ceiling for the classes represented by the other modes. Alternatively, we suggest using probabilistic instead of discrete predictions. With probabilistic predictions, the optimal model does not predict the mode, but a probability distribution across labels equal to the proportion of each. For stimulus, Si, with R repeats, the probability of each class, ( ' ), is computed as the proportion of labels, " , equal to that class: In our example, the optimal prediction for each repetition of S1 is [0.667, 0.333] for "Anger" and "Disgust" and the optimal prediction for each repetition of S2 is also [0.667, 0.333]. The classaverage AUROC noise ceiling would coincidentally be the same as with discrete predictions, 0.6667. Fig. S1. Difference in performance between cultures, shown separately for each model. * indicates a statistically significant difference between cultures for a specific emotion (at p < 0.05; independent t-test).

Fig. S2.
Results from ablation analysis, visualized separately for Western (WE) participants (top), East-Asian (EA) participants (middle), and differences between WE and EA participants. For ease of interpretation, only the performance-critical AUs (i.e., AUs that decrease performance when ablated) are shown in the top and middle panel.
In the panel with the differences between WE and EA, green cells indicate AUs that are more 'performance-critical' for WE participants than for EA participants and orange cells indicate AUs that are more performance-critical for EA participants than for WE participants.    Fig. S6. Performance of optimized, culture-agnostic models, split by participant culture. Similar to the results obtained with optimized, culture-specific models (cf. Figure 5C), there are no significant differences in model performance between WE and EA participants for any emotion, with the notable exception of a significantly higher AUROC for WE participants than EA participants for surprise (t = 2.86, p = 0.0069, d = 0.93).   S8. Performance of the original (non-optimised) models, split by trial decile (x-axis; 1: first 10% of trials, 2: second 10% of trials, etc.). Comparing performance on the first half (decile 1-5) of the trials with the second half (decile 6-10) using a paired samples t-test (across 80 participants in the 'train set', α = 0.05) shows no significant differences for any of the emotions. Note: The cosine, sigmoid, and linear kernels are measures of similarity, but the Euclidean, L1, and L2 kernels measure distance. For these distance functions, the distances were converted to similarities by inverting them. Only the train data was used to evaluate the different kernels. Note: The + symbol means that AUs occur together. The ∨ symbol represents "or", so e.g. (26 ∨ 27) means that either AU25 or AU26 may be included in the configuration. When multiple configurations are explicitly proposed for a given emotion, they are represented as separate bullet points (-). 'None' means that the emotion is represented by an absence of any AU. Note: 'panel' refers to the panel (A, B, C, D) of Figure 6 in the main text. For panel 6A and 6C, the statistics for the two listed emotions are the same because AUROC is the same for both classes in a binary (i.e., two-class) classification context. For the statistics in panel 6A and 6C, a one-sample t-test was used against a population mean of 0.5 (chance-level AUROC); for the statistics in panel 6B and 6D, a two-sample t-test was used to test the difference in performance for WE and EA participants.